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We study numerically the properties of the bayesian perceptron through a gradient descent on the 
optimal cost function. The theoretical distribution of stabilities is deduced. It predicts that the 
optimal generalizer lies close to the boundary of the space of (error-free) solutions. The numerical 
simulations are in good agreement with the theoretical distribution. The extrapolation of the 
generalization error to infinite input space size agrees with the theoretical results. Finite size 
corrections are negative and exhibit two different scaling regimes, depending on the training set 
size. The variance of the generalization error vanishes for N —* oo confirming the property of 
self-averaging. 

PACS numbers : 87.10. +e, 02.50.-r, 05.20.-y 



I. INTRODUCTION 

A neural network is able to infere an unknown rule 
from examples. We address specifically classification 
tasks in which a single neuron -a perceptron- connected 
to N input units through weights w = [w\, . . . , wn) at- 
tributes labels ±1 given by a — sign(w ■ £) to input pat- 
terns £ = (£i, . . . , fjv). A perceptron is able to correctly 
classify linearly separable (LS) problems; the hyperplane 
orthogonal to w separates, in the input space, the pat- 
terns given positive outputs from those given negative 
outputs. Given a set L a of P — aN examples, i.e. train- 
ing patterns £ M (fi — 1, . . . , P) with their corresponding 
class t^, the process of finding the weights w is called 
learning. Generally, if the problem is LS, there is a fi- 
nite volume of error-free solutions in weights space. This 
volume is called version space. 

Most of the learning algorithms proposed so far may be 
stated as the minimization of a cost function or empirical 
risk E(w; L a ) in the weights space. The structure of the 
problem of learning from examples allows for a statistical 
mechanics analysis, in which the cost function is consid- 
ered as an energy. The performance of the learning algo- 
rithm is calculated through thermal averages with Boltz- 
man distribution in weights space and quenched averages 
over all the possible training sets. In the thermodynamic 
limit N — * oo, P — + oo with a — P/N constant, the zero 
temperature limit of these averages accounts for the typ- 
ical behaviour of the algorithm. The fraction of training 
errors e*, the generalization error e g , and the distribution 
of distances of the training patterns to the separating hy- 
perplane p(jy) can be determined with the assumption of 
self-averaging. 

The minimization of the number of training errors, 
called the Gibbs algorithm, is not the best learning strat- 
egy in the case of LS problems, because it picks up one 
point in version space at random. Its typical generaliza- 
tion error (see @ below for the definition) vanishes with 



the size of the training set like e g w 0.625/a |Q. A more 
elaborate strategy is to look for those weights in version 
space that maximize the distance of the separating hyper- 
plane to its closest training patterns [^|j3| . These patterns 
are called the support vectors [Q, and define the maxi- 
mal stability (or maximum margin) perceptron (MSP), 
that lies at the center of the version space, and whose 
generalization error vanishes in the large a regime like 
e g w 0.5005/a [gig]. As the training set only contains 
a small fraction of the information needed to find the 
underlying rule generating the examples, there is a lower 
bound to e g , given by Bayes decision theory j7j. Bayesian 
performance may be implemented by what is called a 
commitee machine ||: through the vote of a large num- 
ber of perceptrons trained with Gibbs algorithm. The 
bayesian generalization error vanishes like e g « 0.442/a, 
in the limit of large a. However, the convergence of the 
commitee machine to the optimum is guaranteed only in 
the limit of an infinite number of preceptrons. In order 
to circumvent the complexity of the commitee machine, 
several learning algorithms for single perceptrons, based 
on the minimization of ad-hoc cost functions, have been 
recently proposed &^,pi- In these approaches, the cost 
function is sought within a given class of functions and 
has a free parameter which has to be optimized for each 
value of cv, the fraction of training patterns. The gen- 
eralization performance of these algorithms is very close 
to the bayesian optimal value. Some of them end up 
with a finite fraction of training errors, suggesting that 
the optimal solution might lie outside the version space, 
but it has been established that this is not the case jlO) . 
More recently, the cost function that minimizes the gen- 
eralization error, was determined through a variational 
approach, and it was showed that its minimum endows 
the perceptron with the optimal, bayesian, generalization 
performance p| . 

In this paper, after a somewhat different derivation 
of the optimal cost function, we determine the typical 
distribution of distances of the training patterns to the 
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bayesian hyperplane, and we present simulation results 
that confirm the theoretical predictions. We find that 
the optimal bayesian student lies close to the boundary 
of the version space. The finite size corrections to the 
generalization error are negative and present two differ- 
ent scaling behaviours as a function of a. 



II. THEORETICAL RESULTS 

In this section, we present an alternative derivation of 
the optimal potential for learning linearly separable tasks 
for completeness and we deduce the training patterns 
distance distribution. The theoretical problem is formu- 
lated as follows: the probability that the classifier assigns 
class a to pattern £ after training with a set L a of aN 
examples is P(oj{£, w, L a }) = 9(erw • g)P(w\L a )P(£), 
where Q(x) is the Heaviside function. In general the 
posterior probability P(w\L a ) is determined through the 
minimization of a cost function E(w; L a ) which depends 
on the training set. In order to derive the properties of a 
training algorithm minimizing a cost function, it is useful 
to introduce a fictitious temperature 1/(3, and consider 
the finite temperature probability 



P(w|L a ; 0) = p(w)- 



: -/3_E(w;L a ) 

Z(L a ;0) : 



(1) 



where p(w), called the prior probability density, allows 
to impose constraints to the weights, and Z(L a ; 0) is the 
partition function 



Z(L a ;0) = / exp[-/3S(w;L Q )]p(w)dw. 



(2) 



The typical behaviour of any intensive quantity X(w) is 
obtained under the assumption of self-averaging through 
the quenched average over all the possible training sets 
L a of the same size a, in the thermodynamic limit 
TV — ► oo (taken at constant a = P/N) and in the zero 
temperature limit: 



lim / P(L a )dL a 
lim / A(w)P(w|L Q ;/3)dw 



(3) 



where -C • • ■ ^ stands for the double average, over the 
weights w and the training sets L a . 

If the cost function has a unique minimum w*(L a ) 
(this may not be the case, as happens when the cost func- 
tion is the number of training errors), then P(w\L a ) = 
<5(w — w*(L a )). In this case, the average between brack- 
ets in (||) is reduced to X(L a ,N), which is a random 
variable that depends on the particular training set re- 
alization through w*(L a ). The width of its probability 
distribution function is expected to vanish in the ther- 
modynamic limit, i.e. all the training sets endow the 



perceptron with the same properties, with probability 
one. This property is called self-averaging. As a conse- 
quence, X(L a ,N) may be calculated by averaging over 
all the possible training sets to get rid of the particular 
training set realization. The replica method of statistical 
mechanics has been developped to cope with the aver- 
ages over so called quenched variables which in this case 
correspond to the realizations L a . 

Consider the paradigm of learning a LS rule from ex- 
amples: for each pattern a teacher perceptron of 
weight vector v defines the corresponding target r p = 
sign(v • £ M ). As usual, we assume that the P = aN 
training patterns are independently selected with a prob- 
ability density function P(£''), and that the cost function 
the student's weights w have to minimize is an additive 
function of the examples, 



25(w;L a ) = £V(7*), 



(4) 



where the potential V depends on the training pattern /i 
and its class through the stability 



w • £v 

y = t m S_ 



y/w ■ w 



(5) 



As the outputs and are invariant under the trans- 
formations w,v — * aw, a'v with a, a' > 0, the teacher's 
and student's weights spaces may be restricted to the hy- 
perspheres w 2 = N and v 2 = N respectively without any 
loss of generality. Most training algorithms can be cast 
in the form (Q). If the minimum of (Q) is unique and V('y) 
is diffcrentiable, the weights w can be obtained by a gra- 
dient descent. This is not the case for Gibb's algorithm, 
whose potential is the non-differentiable error-counting 
function y G ( 7 ) = 9 (-7). 

The generalization error e s (w) is the probability that a 
pattern, chosen at random with the same probability den- 
sity as the training patterns, be misclassificd by the stu- 
dent perceptron. Its typical value depends on the overlap 
R =<?; v • w/N between the student and the teacher 
weight vectors w and v, 



e<C £g(w) ^>= — arccosP. 

7T 



(6) 



We assume that the training patterns are identically 
distributed random variables whose components have 
zero mean < £f >= and unit variance < £f£J >= 
S^Sij. The free energy per neuron is averaged over the 
training sets with the replica method under the assump- 
tion of replica symmetry, which will be shown to be sta- 
ble. The extremum conditions on the free energy that 
determine the overlap R are ||: 

l-/?- = 2n/ // j ~ Rt } \ (\(t : r)-tf Df, (7a) 



R = 2 a / exp I — 



VT^P 2 

t 2 \ (A(t; c) - t) dt 
2(1 -P 2 ) J 2irVT^W 



(7b) 
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with Du = exp(-u 2 /2)du/V2TT and H(t) = j t °° Du = 
(l/2)erfc(t/V2)- The parameter c is the /3 — > oo limit 
of — g), where q is the overlap between two solu- 
tions in the student's space. If the cost function (|4|) has 
a single global minimum, q — > 1 and c is finite. The 
function A(i;c), determined by the saddle point equa- 
tion of the free energy for (3 — > oo, minimizes W(X) = 
V(X) + (A — t) 2 /2c with respect to A. For cost functions 
having continous derivatives A(t; c) satisfies: 



(8) 



The solution to (Q) has to verify the necessary condition 
for local stability of the replica symmetric solution M : 



-Rt 



r+oo 

2 a I DtH , 



(A'(i; C )-l) 2 <l, (9) 



where A' = dX/dt. It has recently been shown that 
(^) can only be satisfied if (||) is invertible, which im- 
poses Q: 



cV" 



d 2 V 

C lx^ > 



(10) 



If V(X) is known, can be calculated through the 
solution of equations (0). Instead of solving this direct 
problem, we are interested in finding the best potential 
within the class of functions having continuous deriva- 
tives. Instead of using Schwartz inequality as in pd[ , we 
show that a staightforward functional minimization of R 
leads to the same result. As only the product j3V appears 
in the partition function (||), we can multiply the poten- 
tial V and the temperature 1//3 by the same constant 
a > leaving Z(L a ;3) invariant. This transformation 
changes c — * c/a in (g) and ([l0|), leaving R unchanged. 
Thus, we may impose c = 1 throughout without any 
loss of generality, which amounts to choosing the energy 
units. 

A further simplification arises from considering R as a 
functional of V through g(t) = X(t) — t. For then we can 
write: 



<?(*) = -^(A(i)), 



where A(i) is the solution to 
become respectively: 

l-R 2 = af(R,g), 



(11) 

Equations (0) and @ 



/OO 
9 2 {t)H 
-OO 



/ -Rt 



Dt 



R = ah(R,g) 
a 



(12a) 
(12b) 



, t 2 \ dt 



2a 



(g'(t)) 2 H 



Rt 



VT~- R 2 



Dt < 1. 



(13) 



Given a, equations dl2| ) and (jl^) must be simultane- 
ously verified by the function g(t) that maximizes R. 
We look for solution g(t) that minimizes (12t), with 



( 12a ) considered as a constraint, introduced through a 
Lagrange multiplier rj. As it is not easy to impose in- 
equality (|T^ ) as supplementary constraint, we minimize 
R = ah(R,g) + r)[l — R 2 — af(R,g)) and we will show 
that our result is consistent, i.e. that the g(t) obtained 
does indeed verify conditions (JlO|) and (13). The func- 
tion g(t) that maximizes R satisfies SR/6g[t) = 0, which 
implies dh/Sg = rjdf/Sg, where S(...)/Sg stands for the 
functional derivative of (...) with respect to g(t). It is 
straightforward to deduce the expression for g: 



9(t) 



exp 



( R 2 t 2 > \ 



2^2n(l-R 2 ) H 



Rt 



(14) 



where 77 and R depend implicitly on a. After introduction 
of Jl4| ) into (12), we find the solutions R(a) = 1Z and 
77(a): 



n 2 



VT^R 2 



00 exp 
Dt 



H(-nt) 



r,- 1 (a)=2 



l-R 2 

n 



(15a) 



(15b) 



They determine, through (|Tj), the function g for each 
value of a: 



g{t-a)=T 2 -\nH 



(16) 



where we wrote g(t; a) to stress the a dependence, and 
T 2 = (1 - n 2 )/n 2 . It is straightforward to verify that 
g(t; a) satisfies the stability condition ( |l3| ) for all a, jus- 
tifying our assu mption of replica symmetry. A com- 
paraison of ( |l5a| ) with previous results MjX^ shows that 
R(a) = y/Rc{ce), where Rg corresponds to Gibb's algo- 
rithm. The same equation relates the Bayesian general- 
izer to Gibb's algorithm, as was demonstrated by Opper 
and Haussler |1 with a method that makes explicit use of 
the commitee machine architecture. The potential V(X) 
may be obtained by integration of (|ll|): 



V(X) 



9(f) 



t(A) 



dt 1 



dt', 



(17) 



where t;(A) is given by the inversion of A = t + g(t] a), 
and we imposed that V^+oo) = 0. This optimal poten- 
tial endows the perceptron with Bayesian generalization 
performance and depends implicitly on the size of the 
training set through T . It presents a logarithmic diver- 
gence V{X) re -T 2 ln(A) for A -► 0+. As V(X) = 00 for 
negative stabilities to ensure that X(t) is single valued, 
the optimal weight vector lies within the version space. 
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For A — > oo, V(A) « T 3 exp(— A 2 /2T 2 )/A. Thus, the 
range of the potential decreases for increasing values of 
a and vanishes as a — > oo showing that the most relevant 
patterns for learning are located within a narrow window, 
on both sides of the student's hyperplane, whose width 
shrinks like T for increasing a (T ~ 1/a for a 3> 1). 
With this cost function, the optimal generalizer may be 
found by a simple gradient descent, with neither the need 
to train an infinite number of perceptrons for implement- 
ing a commitee machine, as was suggested by Opper and 
Haussler Q, nor to determine a large number of 'sam- 
plers' of the version space, as proposed by Watkin pof . 

Once the potential is known, it is straightforward to 
calculate the distribution of stabilities of the training set: 



pin) =< p 1^(7 -7^) > • 



Its general expression is || : 

p( 7 ) = 2 j^DtH(-^jS[X(t)- 7 } 



(18) 



(19) 



with X(t) = t + g(t; a). In terms of £(7), 



Pil) 



exp 



H 



-t(j)\ dt 



T 



(7) (20) 



which depends on a through T. In the present case, as 
all the patterns have positive stabilities, ^(7) is the dis- 
tribution of the distances of the training patterns to the 
student's hyperplane. Distributions obtained through a 
numerical inversion of X(t), for several values of a, are 
plotted in fig [I] [Q. The density of patterns is expo- 
nentially small, p(t) « [T/iry] cxp[-T 2 /2(TZ^) 2 } at small 
distance to the hyperplane. It increases with 7 up to a 
maximum at 7m (a)- At larger 7 there is a crossover to a 
gaussian distribution, p(~/) « (y/2/n) exp(— 7 2 /2) iden- 
tical to the teacher's one. Both 7m (a) and the crossover 
distance get closer to the hyperplane with increasing a. 
In the large a limit, both quantities vanish like 1/a, with 
7a/ ~ 1.769/a. Thus, the region of disagreement between 
the student's and the teacher's distributions decreases for 
increasing size of the training set. In the limit a — > 00, 
the bayesian distribution is identical to the teacher's one. 

It is worthwhile to compare the present results with 
the MSP, whose weight vector is the one with maximal 
distance from all the hyperplanes that define the version 
space. The corresponding distribution p(j) presents a 
gap for 7 < k, and a S peak at 7 = K, which is precisely 
half the smallest width of the version space. In the large 
a limit, k 1.004/a is smaller than 7m- The fact that 
the bayesian student has patterns at vanishing distance 
from the hyperplane, and has most patterns at distances 
larger than k, allows us to conclude that its weight vec- 
tor lies close to the boundary of the version space. It has 
been shown |l^] that the bayesian weight vector is the 
barycenter of the (strictly convex) version space. Our 



result means that the barycenter of the version space is 
far from its center, which is rather surprising, and might 
indicate that the version space is highly non-spherical. 
Notice that the teacher weight vector lies even closer to 
the version space boundary, as it has a finite distribution 
of stabilities for all 7 > 0. This explains why some poten- 
tials recently proposed may reach a generalization 
error lower than the MSP, in spite of the fact that they 
find a solution outside the version space, i.e. without 
correctly learning the complete training set. 



III. SIMULATION RESULTS 

The theoretical results of the preceding section were 
obtained in the thermodynamic limit, N — > 00, P — > 
00, with a = P/N finite. In this section we present 
results of thorough numerical simulations that confirm 
very nicely the theoretical predictions, and are precise 
enough to determine the finite size corrections. 

We describe first our implementation of the learning 
procedure. Given a training set, the optimal student is 
found by a gradient descent on the cost function (||) with 
potential (17). In practice, only the derivative of the 
potential is needed, and we do not need to perform the 
integration in (p~7|) . As dV/dX is the function — g(t; a) de- 
fined by equation (|l6|), evaluated at t = t(X) given by 
we only have to invert the equation X(t) = t + g(t; a). We 
calculated numerically dV/ dX for each value of a consid- 
ered. As the optimal potential diverges for negative sta- 
bilities, the minimization has to be started with a weight 
vector w(0) already inside the version space. In our simu- 
lations, we determined w(0) by minimization of the cost 
function (|j) with potential V(X) = 1 — tanh(/37/2), in 

which the value of (3 has to be optimally tuned [ 1 . We 

used the implementation called Minimerror JT5 ]", that 
finds the best value of (3 together with the weights w(0) 
through a deterministic annealing. Starting from w(0), 
the weights are iteratively modified through 



w = w(fc) — e(fc) (5w, 



dV 



w(fc + 1) 



N- 



(21a) 
(21b) 

(21c) 



where 7^ is the stability (||) of pattern /i. Actually, 
the derivative dE/dw has two terms, a nd o nly one of 
them is taken into account in equation (21b). The ne- 
glected term, that contributes to keep w ■ w constant 
only to first order in e, has been replaced by the normal- 
ization (|21c[ ). A straightfor ward calculation shows that 
the component of Sw (eq. ( |21b| )) orthogonal to w(fc), 
6w± = 5w—w(k)5w-w(k)/N, is proportional to dE/dw. 
Thus, at convergence, Sw^ = 5w± ■ 8w± vanishes. Ac- 
tually, the stopping condition in all our simulations was 
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The variable learning rate e(k), introduced to speed-up 
the convergence, is deter mined as follows: at each iter- 
ation, we calculate (21a) for three different values of e: 
e(fc-l)/2, e(fc-l), and 5e(fc-l). The value e(k- l)/2 
should prevent the oscillations that may appear for too 
large learning rates, whereas 5e(fc — 1) allows to accel- 
erate the convergence in regions where the potential is 
flat. At each iteration, we keep for e(k) the value that 
minimizes 6wj_. With this procedure, the initialization 
of e is irrelevant; we used e(0) = 10~ 2 in all our tests. 

In our simulations, we determined the generaliza- 
tion error e g (a,N) and the distribution of stabilities 
p(j; a, N) as a function of N and a. Given a and N, 
we generated training sets L a of P = aN binary pat- 
terns. The components of the input patterns £ M of each 
training sample were chosen at random with probability 
pffi = 1) = p(£f = -1) = 1/2 for all 1 < i < N and 
1 < A* < P- The corresponding outputs r M = sign(v • £ M ) 
are determined by a randomly selected teacher of nor- 
malized weights v (v • v = N). We made simula- 
tions for a — 1,2,4,6,8,10 and 14, and we considered 
at least seven different values of N for each value of 
a. Each training set was learnt with the optimal po- 
tential using (pl|), as explained before. The overlap 
between the obtained normalized weights w*(L a ) and 
the teacher's weights v, R(L a ,N) — w* ■ v/N, deter- 
mines the generalization error of the student perceptron, 
e g (L a ,N) = arccos[i?(L Q , N)]/ir. 

We determined the generalization error for each pair 
(a,N), averaged over M(a,N) training sets, e g (a,N) = 
} e g(L a ,N)/M(a,N). The number of samples 
M(a,N) was chosen large enough to have a good preci- 
sion in the extrapolation to 1/N — > 0. Values of M(a, N) 
ranging from 500 to 20 000 (the larger number of sam- 
ples corresponding to the smaller values of P = aN) 
were used. Most of the simulations were done on a par- 
allel computer that allows for 64 samples to be processed 
simultaneously. The obtained values of e g (a, N) are rep- 
resented on figure ^ as a function of 1/N. All the in- 
vestigated values of a show the same behaviour, and 
only some of them are reported on the figure for rea- 
sons of clarity. The generalization errors are linear in 
1/N because for each a we only considered values of N 
large enough that the second order corrections in 1 /N be 
negligible. The fits to the numerical results extrapolate 
correctly to the theoretical values e g {a) obtained in the 
thermodynamic limit TV — > oo, P — > oo with a = P/N 
constant. The hnite size corrections are negative, mean- 
ing that in finite dimension the expected generalization 
error is lower than predicted by the theory. This result 
can be understood if one considers the information con- 
tent of the training set instead of its size. As the number 
of possible training patterns is 2 N , the training set car- 
ries (on the average) a fraction of information aN/2 N 
which, at constant a, is larger the smaller N. Moreover, 
given a, there is always a value N a large enough that 
aN a > 2 Na , i.e. such that all the possible patterns be- 



long to the training set. One expects that e g (a, N a 



0, 



and that e g (a, N) increases smoothly for increasing N to 
reach e g (a,oo) from below. 

The variance of the generalization error, a g (a,N) = 
2 {ia) (e s (I a ,iV) - e g (a,N)) 2 /M(a,N), is represented 
as a function of 1/N on figure |3| for all the values of a 
considered. The fact that all the lines extrapolate to zero 
shows that, in the thermodynamic limit, the distribution 
of e g (L a ,N) is a delta function: any randomly selected 
training set corresponding to the same a endows the per- 
ceptron with the same typical generalization error, with 
probability one. In other words, the hypothesis of self- 
averaging, underlying the statistical mechanics calcula- 
tions, is correct. 

At finite size, the average generalization error and its 
variance depend on P = aN. To first order in 1/P, we 
may write: 



e g (a, N) = e g (a, oo) - <f>(a)/P, 
a 2 Ja,N)=^(a)/P. 



(22) 
(23) 



The behaviour of 4>{a) and ip(a), displayed on figures 
|] and [5], presents a crossover at a ~ 2, i.e. in the 
neighbourhood of the perceptron's capacity. At large a, 
4>(a) is constant and ip(a) decreases smoothly, whereas 
at small a, both quantities increase with a. Thus, as a 
function of P, finite size corrections to e g vanish slower at 
a < 2 than at large a. This is the reason why we needed 
a larger number of samples for low a in our simulations. 

As N decreases, the mean value of the generaliza- 
tion error distribution, e g {a, N), shifts towards lower val- 
ues, proportionally to 1/N. However, the broadening of 
the distribution, a g (a,N) ~ l/y/~N, overcompensating 
this effect. Thus, in spite of the negative correction to 
e g (a, 00) at finite N, there is a finite probability that a 
particular trained perceptron generalize worse than the 
theoretical prediction. 

The distributions of stabilities follow the same trends 
as the generalization error. Histograms, determined with 
some of our results, are compared to the theoretical den- 
sity distributions, on figures || and 0. On figure [| numer- 
ical results for both the student and the teacher percep- 
trons, are displayed. Although not clearly visible on the 
figure, the finite size teacher has less patterns at small 
distances to the separating hyperplane, the tail of the 
distribution being slightly higher, than the theoretical 
distribution. These discrepancies are much smaller than 
the finite size effects on the student perceptrons, which 
exhibit an increase of the pattern density closer to the 
hyperplane, with a corresponding depletion of the peak 
at 7m- These efects are enhanced at smaller N, as may 
be seen on figure 



IV. CONCLUSION 

In this paper, we presented numerical simulations of 
the simplest neural network, the perceptron, learning op- 
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timally a linear separation task from examples. They 
confirm the theoretical predictions and present interest- 
ing finite size scaling behaviours. 

After a derivation of the optimal learning potential, 
we deduced the theoretical distribution of distances of 
the learned patterns to the separating hypcrplanc, p("f). 
Surprisingly, the optimal student is predicted to be close 
to the boundary of the version space instead of being 
near of its center, as currently believed. 

We presented extensive numerical simulations with the 
aim of clarifying to which extent the theoretical results, 
which predict the typical behaviour of the generalization 
error and the distribution of stabilities in the thermody- 
namic limit, are valid for finite size systems. In partic- 
ular, the numerically determined distribution of stabil- 
ities shows that finite size optimal perceptrons lie even 
closer to the version space boundary than the theoretical 
prediction for N — ► 00. The extrapolation of the gen- 
eralization error e g to 1/N — > averaged over a large 
number of samples, confirm the theoretical predictions 
with very high accuracy. The variance of e g vanishes in 
that limit, showing that all the training sets endow the 
perceptron with the same generalization error, with prob- 
ability one. This is just what is meant by the hypothesis 
of self-averaging underlying the replica approach, which 
is thus numerically validated. 

At finite N the mean generalization error is smaller 
than the theoretical value. As the argument that allows 
to understand such result is independent of any learning 
scheme, for it takes into account only the information 
content of the training set, we expect it to be also valid for 
statistical mechanics predictions of e g for other learning 
algorithms. However, it is worth to point out that the 
width of the generalization error distribution grows with 
decreasing N faster than the shift of the mean value. 

As a function of a, e g {a, N) shows two different scaling 
regimes, depending on whether a > 2 or a < 2. The 
crossover at a c — 2 might be correlated to the perceptron 
capacity. As below a c any training set is expected to be 
linearly separable, it seems likely that the generalization 
error presents a different scaling at a < a c . Theoretical 
calculations of finite size corrections remain to be done, 
to clarify the observed scaling regimes. 

Although the simulations were done for binary random 
input vectors, the behaviour of the generalization error 
should be the same for continuous input vectors whose 
components have zero mean and unit variance, as the 
theoretical results only depend on the two first moments 
of the pattern distribution. It would be interesting to see 
whether the observed cross-over at a w 2 persists in this 
case. 
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FIG. 1. Distribution of distances of the training patterns 
to the bayesian separating hyperplane for different values of 
a. 

FIG. 2. Average generalization error vs. 1/JV. Error 
bars are not visible at the scale of the figure. Lines are least 
squared fits to the numerical data, which are extrapolated to 
1/JV — 0. Full symbols correspond to the theoretical values. 

FIG. 3. Variance of the generalization error vs. 1/JV. 
Lines are least squared fits to the numerical data. 

FIG. 4. Slope of the finite size corrections to the general- 
ization error. 

FIG. 5. Slope of the finite size variance of the generaliza- 
tion error. 

FIG. 6. Theoretical and numerical (JV = 100) distribution 
of stabilities for the optimal student and the teacher for a = 4. 

FIG. 7. Theoretical and numerical (JV = 20 and 65) dis- 
tribution of stabilities for the optimal student for a = 6. 



7 



100e (cc,N) 



i ■ 1 ■ 1 ■ 1 1 1 1 r 




u i i i i i i i i i u 

0.00 0.01 0.02 0.03 0.04 0.05 

1/N 



1 00 <Koc) 



50 



40 - 



30 - 



20 - 



10 - 












a 



10 



15 



1 4 v(cc) 

l — ■ — ■ — 1 — ■ — r 



400 - 



300 - 



200 - 



100 - 











10 



15 



a 



p(y) 




0.0 0.5 1.0 1.5 2.0 



